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Abstract 

By  late  2004,  RTExpress™,  a  compiler  and  runtime 
environment  that  provides  the  capability  for 
MATLAB®  script  files  to  be  directly  compiled  and 
then  executed  on  parallel  high  performance  computers 
(HPC),  will  be  released  for  the  SGI  platform, 
including  the  new  Altix  Itanium  systems.  [1]  This  new 
version  of  RTExpress™  will  take  advantage  of  the 
SGI  hardware  and  software  package,  specifically  64- 
bit  operation,  the  SGI  MPT  (Message  Passing 
Toolkit),  the  SGI  SCSL  (Scientific  Computing 
Software  Library),  and  is  the  first  version  of 
RTExpress™  to  utilize  the  advantages  of  a  global 
shared-memory  system.  This  paper  presents  the  first 
test  results  using  this  new  release.  Improvement  in 
corner-turn  timing  is  anticipated  due  to  the  SGI 
NUMAlink  interconnect  fabric,  as  compared  to  other 
interconnect  technology  common  in  Linux  clusters, 
such  as  Ethernet.  Up  to  an  order  of  magnitude 
improvement  in  comer  turn  performance,  and  overall 
2D  FFT  performance  is  expected. 

1  Introduction 

The  RTExpress™  environment  is  a  software  tool  that 
assists  a  user  in  rapidly  developing  real-time 
embedded  systems.  RTExpress™  is  a  compiler  and 
mntime  environment  that  provides  the  capability  for 
MATLAB®  script  files  to  be  directly  compiled  and 
then  executed  on  parallel  high  performance  computers 
(HPC).  RTExpress™  provides  the  capability  to 
employ  the  power  of  an  HPC  on  standard  MATLAB® 
without  having  to  recode  the  MATLAB®  in  the  HPC 
target  language.  Its  features  include  support  for  real¬ 
time  data  and  machine  performance  visualization, 
multiple  parallelization  paradigms,  multiple 
homogeneous  parallel  architectures,  utilization  of 
machine  specific  optimized  vector  libraries  and  native 
compilers,  and  the  ability  to  change  real-time 
algorithm  parameters  on-the-fly.  [2] 

The  SGI  Scientific  Computing  Software  Library 
(SCSL)  consists  of  several  standard  and  proprietary 
scientific  and  math  functions,  optimized  for  use  on  the 
SGI  platforms.  Included  in  this  package  are  BLAS 
(Basic  Linear  Algebra  Subprograms)  and  LAPACK 
(Linear  Algebra  Package)  libraries.  The  SCSL  library 


supports  64-bit  integer  arguments,  single  and  double 
precision,  and  real  and  complex  data  types.  [3] 
RTExpress™  implementations  have  always  taken 
advantage  of  vendor- supplied  libraries,  as  possible,  to 
fully  exploit  the  target  processing  capabilities. 

The  SGI  Message  Passing  Toolkit  (MPT)  combines 
the  standard  Message  Passing  Interface  (MPI),  which 
is  utilized  by  RTExpress™,  with  the  SHMEM 
Library,  which  extends  the  interprocessor 
communication  for  shared  memory  systems.  [4]  The 
MPT  facilities  are  a  key  element  for  taking  full 
advantage  of  the  SGI  NUMAlink  Interconnect  fabric. 
The  Altix  system  combines  the  NUMAflex  system 
architecture  with  the  standard  components,  including 
the  Intel  Itanium  2  and  the  fully  supported,  64-bit 
Linux  operating  system.  [5] 

The  development  of  the  RTExpress™  environment 
was  funded  under  DARPA/ITO  BAA  95-19. 

2  Performance  Results 

A  MATLAB  script  performs  the  2D  complex  FFT 

init_matrix  -  ones  ( f ft size,  fftsize)  + 
j  *  ones ( fftsize,  fftsize) 

loop 

store  time  tl 
a  =  fft (init_matrix) 
store  time  t2 
a  =  a' 

store  time  t3 
a  =  fft (a) 
store  time  t4 
end  loop 

Elapsed  times  are  computed,  averaging  time  over 
several  iterations. 

The  operation  begins  with  initialization  of  a 
MATLAB  matrix.  Using  RTExpress,  the  script  is 
mapped,  using  the  graphic  tool,  “mapit”,  to  several 
compute  elements  for  a  data-parallel  operation.  In  this 
manner,  the  columns  of  the  data  matrix  are  distributed 
to  the  compute  elements,  thereby  giving  each  compute 
element  only  a  portion  of  the  total  number  of  columns 
to  operate  on.  The  resulting  matrix  is  then  transposed 
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and  redistributed.  This  operation,  the  comer  turn,  is 
typically  the  limiting  operation  for  2D  FFT 
performance.  Lastly,  the  new  columns  are  again 


2D  FFT  timing.  Systems  utilizing  Myrinet™  or 
DolphiNet™  interconnects  show  a  notable 
improvement  to  standard  lOObase-T.  The  timings 
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Figure  1  -  RTExpress  2D  FFT  Performance  Timing 


operated  on  by  the  compute  elements,  this  time  over¬ 
writing  the  input  matrix.  This  in-place  FFT  should 
perform  slightly  faster  since  the  step  to  copy  the  input 
data  is  not  required. 

The  2D  FFT  benchmark  has  been  mn  on  several 
platforms  and  interconnects,  and  Figure  1  shows  some 
of  the  results  taken  for  single-precision,  complex  lk 
by  lk  data  set.  Results  have  also  been  collected  when 
mnning  in  double  precision.  The  Comer  turn 
performance  (Fig  lb)  provides  an  indication  of  the 
inter-processor  communication  capabilities  of  a 
particular  system.  To  date,  most  systems  have  shown 
excellent  scalable  results  for  FFT  performance  (Fig  la 
and  Fig  lc),  however,  the  total  2D  FFT  (Fig  Id)  is 
limited  by  the  performance  of  the  corner  turn.  Most 
Linux  clusters,  utilizing  standard  lOObase-T  Ethernet 
have  interconnect  performance  dominating  the  overall 


from  the  SGI  system  will  be  compared  to  data 
displayed  in  Figure  1. 
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Matlab  FFT  Benchmark  Test  on  Shared  Memory  SGI 


*  A  Matlab  script  performs  the  2D  complex  FFT 

matrix  =  ones(fftsize,  fftsize)  +  j  *  ones(fftsize,  fftsize) 
loop 

store  time  tl 
a  =  fft(init_matrix) 

store  time  t2  mi  elapsed  time  =  t2  -  tl 

_  _  _>  ctm  elapsed  time  =  t3  - 12 

store  time  t3  m2  elapsed  time  =  t4  ‘ 13 

a  =  fft(a) 
store  time  t4 
end  loop 

*  RTExpress  is  used  to  compile  and  run  the  MATLAB  script 
on  the  64-bit  parallel  computer  on  varying  numbers  of 
processors  in  data-parallel 

*  Elapsed  times  are  computed,  averaging  time  over  several 
iterations 

*  First  iteration  is  not  counted 
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Scaling:  64-Processor  Altix 


ID  FFT 


30  40 

Processors 


ID  FFT  in-place 


Corner  Turn 
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Cornerturn  Scaling  is 
excellent  up  to  50 
processors  at  8K  FFT 
(complex/single 
precision) 


30  40 

Processors 


Composite  2D  FFT  Time 


64p  1.7GHz/9MB  cache  Altix  3700  (note:  production  systems  are  1.6GHz) 
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SGI  Shared  Memory  Improves  Cornerturn  and  2D  FFT 


FFT  Timing 

(1024  x  1024  out  of  place) 
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Altix350  1.4  Ghz/3MB  L3  cache 


Timing  Information  NOTES 


*  Please  note  that  all  timing  information  gathered  is  not 
intended  to  provide  a  recommendation  for  any  particular 
hardware,  but  to  illustrate  parallel  operation  with  various 
combinations  of  processors  and  interconnect  systems 

•  Equipment  used  in  the  following  tests  may  no  longer  be  the 

hardware  vendor’s  current  offering 

•  “Improvement,”  or  speed-up,  as  compared  to  first- 
processor  performance  used  to  examine  scaling  rather 
than  absolute  timing 

*  Maximum  RTExpress  performance  may  be  gained  by  fully 
using  vector  operations  in  MATLAB  rather  than  using 
sequential  loops 


Parallel  performance  is  extremely  dependent  on 
a  specific  application  and  implementation 
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What  is  RTExpress™ 

•  Development  and  Runtime  Environment  allowing  MATLAB  scripts  to 
be  compiled  and  executed  on  real-time/parallel  High  Performance 
Computers  (HPC) 

-  Provides  a  flexible  means  to  harness  the  power  of  a  HPC  using  MATLAB 

-  User  does  not  require  detailed  knowledge  of  parallel  programming 

-  Supports: 

•  Embedded  parallel  architectures  such  as  Mercury 

•  SUN  Network  of  Workstations 

•  High  performance  Linux  PC  Servers 

-  Support  for  FPGA  functions 

•  Library  of  FPGA  functions  directly  callable  from  MATLAB  source 

*  Now  porting  to  SGI  Altix  systems 

-  http://www.sgi.com/newsroom/press_releases/2004/june/altix_tcep.html 

-  Intel  /  SGI  Development  Agreement 

-  Itanium  64-bit  processing 


Data  Collection  -  2D  FFT  Tests 


SGI®  Altix™  3000 


•  FFT  Benchmark  tests  -  Computation  and  Communication 

•  ID  FFT,  transpose  (cornerturn),  ID  FFT  in-place 

•  Testing  on  SGI  Altix  Linux  servers  with  Shared  Memory 
Interconnect 

•  Results  of  Shared  Memory  interconnect  shows  improved  scaling  on 
cornerturn 

SGI®  NUMAlink™  Interconnect  Fabric 


2D  FFT  Benchmark  Test  using  RTExpress 


-  Shared  Memory  Architecture 

-  New  RTExpress  for  SGI  release  expected  by  fall/winter  of  2004 


•  A  Matlab  script  performs  the  2D  complex  FFT 
matrix  =  ones(fftsize,  fftsize)  +  j  *  ones(fftsize,  fftsize) 
loop 

store  time  tl 
a  =  fft(init_matrix) 


ID  FFT  in-place 


Composite  2D  FFT  Time 


Parallelization  with  RTExpress™ 
is  Flexible  and  Efficient 


store  time  t2 
a  =  a’ 

store  time  t3 
a  =  fft(a) 
store  time  t4 
end  loop 

•  RTExpress  is  used  to  run  the  MATLAB  script  on  varying  numbers  of 
processors  in  data-parallel 

•  Elapsed  times  are  computed,  averaging  time  over  several  iterations 

•  First  iteration  is  not  counted 


Please  note  that  all  timing  information  gathered  is  not  intended  to  provide  a 
recommendation  for  any  particular  hardware,  but  to  illustrate  parallel 
operation  with  various  combinations  of  processors  and  interconnect 
systems 

Equipment  used  in  the  following  tests  may  no  longer  be  the  hardware 

vendor’s  current  offering 

Maximum  RTExpress  performance  may  be  gained  by  fully  using  vector 
operations  in  MATLAB  rather  than  using  sequential  loops 


64p  1.7GHz/9MB  cache  Altix 
(note:  production  systems  are  1.6GHz) 


Developing  Parallel  Applications  with 
RTExpress™ 


•  Supports  MP1 


Source  Code 
Linking  s 


Integrated  Sensors,  Inc.  (315)798-1377  www.sensors.com 


